Revert "[Bugfix] Use heartbeats instead of health checks (#8583)" Test by MitchLewis930 · Pull Request #16 · Signal65/vllm-code-review

MitchLewis930 · 2026-01-21T18:11:09Z

This reverts commit 6e0c9d6.

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Summary by CodeRabbit

New Features
- Added a dedicated health-check mechanism for improved engine monitoring and diagnostics.
Bug Fixes
- Enhanced error handling to properly report and propagate engine failures during request processing.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…t#8583)" This reverts commit 6e0c9d6.

coderabbitai · 2026-01-21T18:12:06Z

📝 Walkthrough

Walkthrough

The changes replace a heartbeat-based health signaling mechanism with a dedicated RPC-driven health-check system. A new RPCHealthRequest type is introduced; the client's heartbeat socket and loop are renamed to health-check equivalents; and the engine removes heartbeat thread logic in favor of reactive health request handling. Test timing is adjusted accordingly.

Changes

Cohort / File(s)	Summary
RPC Type Addition `vllm/engine/multiprocessing/__init__.py`	Introduces `RPCHealthRequest` class (empty) and adds it to `RPC_REQUEST_T` union type for new health-check RPC mechanism.
Client Health Mechanism `vllm/engine/multiprocessing/client.py`	Replaces `heartbeat_socket` with `health_socket`, renames `run_heartbeat_loop()` to `run_check_health_loop()`, updates health-check flow to send `RPCHealthRequest` when idle and await acknowledgments.
Engine Health Mechanism `vllm/engine/multiprocessing/engine.py`	Removes heartbeat thread/loop logic, adds `RPCHealthRequest` handler, replaces heartbeat signaling with health socket `HEALTHY`/`UNHEALTHY` messages, adds explicit error-handling in engine step to emit `RPCError` through outputs.
Test Timing & Payload `tests/mq_llm_engine/test_error_handling.py`	Introduces background abort task (2-second delay), expands `SamplingParams(max_tokens)` from 10 to 2000, adds synchronization to await abort task completion.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Engine
    participant HealthChannel

    Client->>HealthChannel: Poll health_socket (idle)
    alt Health Check Needed
        Client->>Engine: Send RPCHealthRequest
        Engine->>Engine: Receive & process health request
        Engine->>Engine: Call engine.check_health()
        alt Engine Healthy
            Engine->>HealthChannel: Send HEALTHY via health_socket
        else Engine Unhealthy
            Engine->>HealthChannel: Send UNHEALTHY via health_socket
        end
        HealthChannel->>Client: Receive health status
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Heartbeats fade, health checks bloom bright,
No background pulse through the night,
RPC whispers when we need to know,
Are you well, dear engine? Let's go! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title describes reverting a heartbeat/health-check bugfix, which directly aligns with the changeset that replaces heartbeat logic with health checks throughout multiple files (client, engine, and init).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@vllm/engine/multiprocessing/engine.py`:
- Around line 282-288: In _handle_health_request, when self._errored_with is set
you currently call self._send_unhealthy(self._errored_with) but then continue to
run self.engine.check_health() and self._send_healthy(), which can mask the
error; fix by returning immediately after calling
self._send_unhealthy(self._errored_with) so that _handle_health_request does not
invoke engine.check_health() or _send_healthy in the errored path (update the
_handle_health_request method to short-circuit on self._errored_with).

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e0c9d6 and ce916ce.

📒 Files selected for processing (4)

tests/mq_llm_engine/test_error_handling.py
vllm/engine/multiprocessing/__init__.py
vllm/engine/multiprocessing/client.py
vllm/engine/multiprocessing/engine.py

🧰 Additional context used

🧬 Code graph analysis (1)

vllm/engine/multiprocessing/client.py (1)

vllm/engine/multiprocessing/__init__.py (5)

RPCError (35-38)

RPCHealthRequest (46-47)

RPCProcessRequest (25-31)

RPCStartupRequest (50-51)

RPCStartupResponse (55-56)

🔇 Additional comments (7)

tests/mq_llm_engine/test_error_handling.py (1)

156-175: Delayed abort synchronization improves test determinism.

Scheduling the abort after a short delay and awaiting the task keeps the abort within the generation window and avoids races with the subsequent health check; the larger max_tokens supports that.

vllm/engine/multiprocessing/__init__.py (1)

46-60: RPCHealthRequest integration looks consistent.

Adding the type and extending RPC_REQUEST_T aligns the public protocol with the new health-check RPC flow.

vllm/engine/multiprocessing/client.py (1)

19-25: Health-check loop wiring looks coherent.

The new health_socket, proactive RPCHealthRequest probes, and setup launch of the check loop fit together cleanly.

Also applies to: 98-101, 127-155, 236-237

vllm/engine/multiprocessing/engine.py (4)

14-20: Health socket wiring aligns with the new health channel.

Importing RPCHealthRequest and binding a dedicated health socket match the client-side health path.

Also applies to: 94-96

201-214: RPCError emission on engine_step exceptions is solid.

Sending RPCError before re-raising ensures clients see the root failure.

232-235: Health requests are now handled in the input loop.

Routing RPCHealthRequest through _handle_health_request keeps health checks responsive.

297-303: Health responses on the health socket look correct.

Using the dedicated health socket for healthy/unhealthy responses matches the client’s poll loop.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-21T18:20:52Z

vllm/engine/multiprocessing/engine.py

+    def _handle_health_request(self):
        if self._errored_with is not None:
            self._send_unhealthy(self._errored_with)

-        # Check for life of the main loop
-        elif time.time() - self._last_alive_time > self.last_alive_threshold:
-            self._send_unhealthy(RuntimeError("Engine loop has died"))
-
-        else:
-            # Otherwise- check health of the engine
-            # self.engine.check_health() raises on unhealthy
-            try:
-                self.engine.check_health()
-                self._send_healthy()
-            except Exception as e:
-                self._set_errored(e)
-                self._send_unhealthy(e)
+        # Raises error if unhealthy.
+        self.engine.check_health()
+        self._send_healthy()


⚠️ Potential issue | 🟠 Major

Return after reporting unhealthy state.

When _errored_with is set, the method sends UNHEALTHY but then still runs check_health() and can send HEALTHY, which can mask failures.

🔧 Proposed fix

def _handle_health_request(self): if self._errored_with is not None: self._send_unhealthy(self._errored_with) + return # Raises error if unhealthy. self.engine.check_health() self._send_healthy()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _handle_health_request(self):

if self._errored_with is not None:

self._send_unhealthy(self._errored_with)

# Check for life of the main loop

elif time.time() - self._last_alive_time > self.last_alive_threshold:

self._send_unhealthy(RuntimeError("Engine loop has died"))

else:

# Otherwise- check health of the engine

# self.engine.check_health() raises on unhealthy

try:

self.engine.check_health()

self._send_healthy()

except Exception as e:

self._set_errored(e)

self._send_unhealthy(e)

# Raises error if unhealthy.

self.engine.check_health()

self._send_healthy()

def _handle_health_request(self):

if self._errored_with is not None:

self._send_unhealthy(self._errored_with)

return

# Raises error if unhealthy.

self.engine.check_health()

self._send_healthy()

🤖 Prompt for AI Agents

In `@vllm/engine/multiprocessing/engine.py` around lines 282 - 288, In _handle_health_request, when self._errored_with is set you currently call self._send_unhealthy(self._errored_with) but then continue to run self.engine.check_health() and self._send_healthy(), which can mask the error; fix by returning immediately after calling self._send_unhealthy(self._errored_with) so that _handle_health_request does not invoke engine.check_health() or _send_healthy in the errored path (update the _handle_health_request method to short-circuit on self._errored_with).

Revert "[Bugfix] Use heartbeats instead of health checks (vllm-projec…

ce916ce

…t#8583)" This reverts commit 6e0c9d6.

coderabbitai bot reviewed Jan 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Revert "[Bugfix] Use heartbeats instead of health checks (#8583)" Test#16

Revert "[Bugfix] Use heartbeats instead of health checks (#8583)" Test#16
MitchLewis930 wants to merge 1 commit intobench/PR_002_basefrom
bench/PR_002_bug__CodeRabbit

MitchLewis930 commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 21, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

MitchLewis930 commented Jan 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MitchLewis930 commented Jan 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 21, 2026 •

edited

Loading